Text Analysis with Movie Plot Similarity

Text Mining, Clustering, Natural Language Processing

Movies have been and are the topic of the hottest discussions in pop culture. The majority of folk (from my own observation) love watching movies, and whilst we may love and hate some, we often see that we have a preference for movies of a similar genre. Some of us love watching action movies, while some of us like watching horror. Some of us like watching movies that have karate/ninjas in them, while some of us like watching superheroes. Movies within a genre often share common base parameters.

As a previous action movie lover (I've drifted towards fantasy), consider the movies 'White House Down' (2013) and 'Olympus Has Fallen' (2013), both are movies based terroristic attacks on the president with an unlikely hero saving the day. Having watched both, they indeed share many similarities. We could conclude that both of these fall into the same genre of movies based on intuition, but wouldn't it be nice to determine that using quantitatively using statistics.


Aim of Project

In this project, we shall seek to quantify the similarity of movies based on their plot summaries available on IMDb and Wikipedia, then separate them into groups, also known as clusters. We'll create a dendrogram to represent how closely the movies are related to each other.


Brief Theory on NLP & Text Mining

In this project we shall make use of Text Mining methods, cluster analysis and natural langauge processing. In general there is a lot of similarity/overlap between NLP and text mining. NLP is used to understand human language by analyzing text, speech, or grammatical syntax. Text mining is used to extract information from unstructured and structured content. They differ in terms of their goal. However, NLP is used/is a component of text mining that performs a special kind of linguistic analysis that essentially helps a machine “read” text. Simply put, NLP breaks down the language complexities, presents the same to machines as data sets to take reference from, and also extracts the intent and context to develop them further.

NLP methods are great and their popularity have grown tremendously in recent years as part of artificial intelligence. In fact, Alexa and Siri, text and email autocorrect, customer service chatbots. They all use machine learning algorithms and Natural Language Processing (NLP) to process, “understand”, and respond to human language, both written and spoken. Although NLP and its sister study, Natural Language Understanding (NLU) are constantly growing in huge leaps and bounds with their ability to compute words and text, human language is incredibly complex, fluid, and inconsistent and presents serious challenges that NLP is yet to completely overcome. Some challenges of NLP are listed below:

  • Contextual Words, Phrases and Homonyms
    • The same words and phrases can have different meanings according the context of a sentence and many words – especially in English – have the exact same pronunciation but totally different meanings
  • Synonyms
    • can lead to issues similar to contextual understanding because we use many different words to express the same idea
  • Irony & Sarcasm
    • present problems for machine learning models because they generally use words and phrases that, strictly by definition, may be positive or negative, but actually connote the opposite

Other challenges include ambiguity, errors in text (like spelling errors), slang or colloquialisms.

The information garnered and displayed here were taken from the following sources: monkeylearn, thinkml and techgig.


Plan of Development

We begin by setting up our jupyter environment. We load the packages and read in the data. We shall then quickly inspect the data to see if cleaning is required. Much of the preprocessing of the data will be dope using text mining methods. We tokenize the plots of the movies, followed by stemming. After that we create a TFID vector and perform clustering using K-means (a non-hierarchical method) and hierarchical clustering to generate a dendrogram. From that we can identify movies which fall into the same clusters - similar plot movies.


Setup

We begin by importing the necessary modules for this project. We then read in the data used for this project, which was taken from IMDb and Wikipedia. An important module is the NLTK package. The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning

In [2]:
# Import modules
import numpy as np
import pandas as pd
import nltk

Now lets read in the data and display it, raw.

In [3]:
# Set seed for reproducibility
np.random.seed(5)

# Read in IMDb and Wikipedia movie data (both in same file)
try:
    movies_df = pd.read_csv("datasets/movies.csv")
    print("Movie dataset has {} samples with {} features each.".format(*movies_df.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")


print("Number of movies loaded: %s " % (len(movies_df)))

# Display the data
movies_df.head(7)
Movie dataset has 100 samples with 5 features each.
Number of movies loaded: 100 
Out[3]:
rank title genre wiki_plot imdb_plot
0 0 The Godfather [u' Crime', u' Drama'] On the day of his only daughter's wedding, Vit... In late summer 1945, guests are gathered for t...
1 1 The Shawshank Redemption [u' Crime', u' Drama'] In 1947, banker Andy Dufresne is convicted of ... In 1947, Andy Dufresne (Tim Robbins), a banker...
2 2 Schindler's List [u' Biography', u' Drama', u' History'] In 1939, the Germans move Polish Jews into the... The relocation of Polish Jews from surrounding...
3 3 Raging Bull [u' Biography', u' Drama', u' Sport'] In a brief scene in 1964, an aging, overweight... The film opens in 1964, where an older and fat...
4 4 Casablanca [u' Drama', u' Romance', u' War'] It is early December 1941. American expatriate... In the early years of World War II, December 1...
... ... ... ... ... ...
95 95 Rebel Without a Cause [u' Drama'] \n\n\n\nJim Stark is in police custody.\n\n \... Shortly after moving to Los Angeles with his p...
96 96 Rear Window [u' Mystery', u' Thriller'] \n\n\n\nJames Stewart as L.B. Jefferies\n\n \... L.B. "Jeff" Jeffries (James Stewart) recuperat...
97 97 The Third Man [u' Film-Noir', u' Mystery', u' Thriller'] \n\n\n\nSocial network mapping all major chara... Sights of Vienna, Austria, flash across the sc...
98 98 North by Northwest [u' Mystery', u' Thriller'] Advertising executive Roger O. Thornhill is mi... At the end of an ordinary work day, advertisin...
99 99 Yankee Doodle Dandy [u' Biography', u' Drama', u' Musical'] \n In the early days of World War II, Cohan ... NaN

100 rows × 5 columns

We see that our data frame contains information about what the movie is, the genre, Wikipedia plot line and IMDb plot line. The two columns titled wiki_plot and imdb_plot, reflect the relevant plot lines (i.e., they are the plot found for the movies on Wikipedia and IMDb, respectively). The text in the two columns is similar, however, they are often written in different tones and thus provide context on a movie in a different manner of linguistic expression. Further, sometimes the text in one column may mention a feature of the plot that is not present in the other column.

For example, consider the following plot extracts from The Godfather:

  • Wikipedia: "On the day of his only daughter's wedding, Vito Corleone"
  • IMDb: "In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone's daughter Connie"

While the Wikipedia plot only mentions it is the day of the daughter's wedding, the IMDb plot also mentions the year of the scene and the name of the daughter.

Let's combine both the columns to avoid the overheads in computation associated with extra columns to process.

In [5]:
# Combine wiki_plot and imdb_plot into a single column
movies_df["plot"] = movies_df["wiki_plot"].astype(str) + "\n" + \
                    movies_df["imdb_plot"].astype(str)

# Inspect the new DataFrame
movies_df.head()
Out[5]:
rank title genre wiki_plot imdb_plot plot
0 0 The Godfather [u' Crime', u' Drama'] On the day of his only daughter's wedding, Vit... In late summer 1945, guests are gathered for t... On the day of his only daughter's wedding, Vit...
1 1 The Shawshank Redemption [u' Crime', u' Drama'] In 1947, banker Andy Dufresne is convicted of ... In 1947, Andy Dufresne (Tim Robbins), a banker... In 1947, banker Andy Dufresne is convicted of ...
2 2 Schindler's List [u' Biography', u' Drama', u' History'] In 1939, the Germans move Polish Jews into the... The relocation of Polish Jews from surrounding... In 1939, the Germans move Polish Jews into the...
3 3 Raging Bull [u' Biography', u' Drama', u' Sport'] In a brief scene in 1964, an aging, overweight... The film opens in 1964, where an older and fat... In a brief scene in 1964, an aging, overweight...
4 4 Casablanca [u' Drama', u' Romance', u' War'] It is early December 1941. American expatriate... In the early years of World War II, December 1... It is early December 1941. American expatriate...

Tokens and Tokenization

Tokenization is the process by which we break down articles into individual sentences or words, as needed. Besides the tokenization method provided by NLTK, we might have to perform additional filtration to remove tokens which are entirely numeric values or punctuation.

The idea of Tokenization stems from the fact that while a program may fail to build context from "While waiting at a bus stop in 1981" (Forrest Gump), because this string would not match in any dictionary, it is possible to build context from the words "while", "waiting" or "bus" because they are present in the English dictionary. As such tokenization forms the very first step and possibly one of the most important steps in the method.

Let us perform tokenization on a small extract from The Godfather to provide an example of it at work.

In [11]:
# Tokenize a paragraph into sentences and store in sent_tokenized
sent_tokenized = [sent for sent in nltk.sent_tokenize("""
                        Today (May 19, 2016) is his only daughter's wedding.
                        Vito Corleone is the Godfather
                        """)]

print(sent_tokenized[0])
print(sent_tokenized[1])
                        Today (May 19, 2016) is his only daughter's wedding.
Vito Corleone is the Godfather

We have broken the text into sentences. Now we can further get tokenzs by breaking down into words.

In [13]:
# Word Tokenize first sentence from sent_tokenized, save as words_tokenized
words_tokenized = [word for word in nltk.word_tokenize(sent_tokenized[0])]
words_tokenized
Out[13]:
['Today',
 '(',
 'May',
 '19',
 ',',
 '2016',
 ')',
 'is',
 'his',
 'only',
 'daughter',
 "'s",
 'wedding',
 '.']
In [15]:
# Remove tokens that do not contain any letters from words_tokenized
import re
filtered = [word for word in words_tokenized if re.search("[a-zA-Z]", word)]

# Display filtered words to observe words after tokenization
filtered
Out[15]:
['Today', 'May', 'is', 'his', 'only', 'daughter', "'s", 'wedding']

We see that the steps taken to conduct tokenization. We initially read in the paragraph and tokenized the paragraph into sentences. We then tokenzied the first sentence to get wrods. From here we removed the tokens that do not contain any letters. The final result (filtered) is shown.

Stemming

We now conduct Stemming. This is the process by which we bring down a word from its different forms to the root word. This helps us establish meaning to different forms of the same words without having to deal with each form separately. For example, the words 'fishing', 'fished', and 'fisher' all get stemmed to the word 'fish'. Or for example – The words care, cared and caring lie under the same stem ‘care’.

Consider the following sentences:

  • "Young William Wallace witnesses the treachery of Longshanks" ~ Gladiator
  • "escapes to the city walls only to witness Cicero's death" ~ Braveheart Instead of building separate dictionary entries for both witnesses and witness, which mean the same thing outside of quantity, stemming them reduces them to 'wit'.

There are different algorithms available for stemming such as:

  • the Porter Stemmer
  • the Snowball Stemmer, etc.

We shall use the Snowball Stemmer. It is a stemming algorithm which is also known as the Porter2 stemming algorithm. Some common rules of Snowball stemming are:

ILY  -----> ILI
LY   -----> Nill
SS   -----> SS
S    -----> Nill
ED   -----> E,Nill
  • Nill means the suffix is replaced with nothing and is just removed.

  • There may be cases where these rules vary depending on the words. As in the case of the suffix ‘ed’ if the words are ‘cared’ and ‘bumped’ they will be stemmed as ‘care‘ and ‘bump‘. Hence, here in cared the suffix is considered as ‘d’ only and not ‘ed’. One more interesting thing is in the word ‘stemmed‘ it is replaced with the word ‘stem‘ and not ‘stemmed‘. Therefore, the suffix depends on the word. Some examples are given below:

Word Stem
cared care
university univers
farily fair
easily easili
singing sing
sings sing

Note, you can also quickly check what stem would be returned for a given word or words using the snowball site. Also, stemming does have some drawbacks. Issues of over stemming and under stemming may lead to not so meaningful or inappropriate stems. Stemming does not consider how the word is being used. For example – the word ‘saw‘ will be stemmed to ‘saw‘ itself but it won’t be considered whether the word is being used as a noun or a verb in the context.

In [17]:
# Import the SnowballStemmer to perform stemming
from nltk.stem.snowball import SnowballStemmer

# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")


# reminder of filtered
print("Filtered: ", filtered)

# Print filtered to observe words without stemming
print("Without stemming: ", filtered)

# Stem the words from filtered and store in stemmed_words
stemmed_words = [stemmer.stem(word) for word in filtered]

# Print the stemmed_words to observe words after stemming
print("After stemming:   ", stemmed_words)
Filtered:  ['Today', 'May', 'is', 'his', 'only', 'daughter', "'s", 'wedding']
Without stemming:  ['Today', 'May', 'is', 'his', 'only', 'daughter', "'s", 'wedding']
After stemming:    ['today', 'may', 'is', 'his', 'onli', 'daughter', "'s", 'wed']

Implementing Tokenization and Stemming Together

We have established how to tokenize and stem sentences. But we may have to use the two functions repeatedly one after the other to handle a large amount of data, hence we can think of wrapping them in a function and passing the text to be tokenized and stemmed as the function argument.

Then we can pass the new wrapping function, which shall perform both tokenizing and stemming instead of just tokenizing, as the tokenizer argument while creating the TF-IDF vector of the text. (TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents, more on this later).

Remember when we 'tokenize-only' for the sentence from The Godfather: "Today (May 19, 2016) is his only daughter's wedding.", we have the following result:

'today', 'may', 'is', 'his', 'only', 'daughter', "'s", 'wedding'

But when we do a 'tokenize-and-stem' operation we get:

'today', 'may', 'is', 'his', 'onli', 'daughter', "'s", 'wed'

Using Tokenization and stemming we get words in their root form, which will lead to a better establishment of meaning as some of the non-root forms may not be present in the NLTK training corpus.

In [20]:
# Define a function to perform both stemming and tokenization
def tokenize_and_stem(text):

    # Tokenize by sentence, then by word
    tokens = [y for x in nltk.sent_tokenize(text) for y in nltk.word_tokenize(x)]

    # Filter out raw tokens to remove noise
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

    # Stem the filtered_tokens
    stems = [stemmer.stem(word) for word in filtered_tokens]

    return stems

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)
['today', 'may', 'is', 'his', 'onli', 'daughter', "'s", 'wed']

Create TfidfVectorizer

Since computers do not understand text, per se - they are suited to understanding numbers and performing numerical computation - we must convert our textual plot summaries to numbers for the computer to be able to extract meaning. As such, a common method in Text Mining is to count all the occurrences of each word in the entire vocabulary and return the counts in a vector. This process is done by CountVectorizer.

For example, consider 'the'. It appears quite frequently in almost all movie plots and will have a high count in each case, but (obviously) it isn't the theme of all the movies.

Term Frequency-Inverse Document Frequency (TF-IDF) is one method which overcomes the shortcomings of CountVectorizer. The Term Frequency of a word is the measure of how often it appears in a document, while the Inverse Document Frequency is the parameter which reduces the importance of a word if it frequently appears in several documents.

For example, when we apply the TF-IDF on the first 3 sentences from the plot of The Wizard of Oz, we are told that the most important word there is 'Toto', the pet dog of the lead character. This is because the movie begins with 'Toto' biting someone due to which the journey of Oz begins.

So ultimaterly, TF-IDF recognizes words which are unique and important to any given document. Let's create one for our purposes.

In [22]:
# Import TfidfVectorizer to create TF-IDF vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer object with stopwords and tokenizer
# parameters for efficient processing of text
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem,
                                 ngram_range=(1,3))

Once we create a TF-IDF Vectorizer, we must fit the text to it and then transform the text to produce the corresponding numeric form of the data which the computer will be able to understand and derive meaning from. To do this, we use the fit_transform() method of the TfidfVectorizer object.

If we observe the TfidfVectorizer object we created, we come across a parameter stopwords.

  • 'stopwords' are those words in a given text which do not contribute considerably towards the meaning of the sentence and are generally grammatical filler words.
  • For example, in the sentence 'Dorothy Gale lives with her dog Toto on the farm of her Aunt Em and Uncle Henry', we could drop the words 'her' and 'the', and still have a similar overall meaning to the sentence. Thus, 'her' and 'the' are stopwords and can be conveniently dropped from the sentence.

On setting the stopwords to 'english', we direct the vectorizer to drop all stopwords from a pre-defined list of English language stopwords present in the nltk module. Another parameter, ngram_range, defines the length of the ngrams to be formed while vectorizing the text.

In [23]:
# Fit and transform the tfidf_vectorizer with the "plot" of each movie
# to create a vector representation of the plot summaries
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in movies_df["plot"]])

print(tfidf_matrix.shape)
/Users/pavansingh/opt/anaconda3/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
  warnings.warn(
(100, 564)

K-Means Clustering

To determine how closely one movie is related to the other by the help of unsupervised learning, we can use clustering techniques. Clustering is the method of grouping together a number of items such that they exhibit similar properties. According to the measure of similarity desired, a given sample of items can have one or more clusters.

A good basis of clustering in our dataset could be the genre of the movies. Say we could have a cluster '0' which holds movies of the 'Drama' genre. We would expect movies like Chinatown or Psycho to belong to this cluster. Similarly, the cluster '1' in this project holds movies which belong to the 'Adventure' genre (Lawrence of Arabia and the Raiders of the Lost Ark, for example).

K-means is an algorithm which helps us to implement clustering in Python. The name derives from its method of implementation: the given sample is divided into K clusters where each cluster is denoted by the mean of all the items lying in that cluster. More specifically, K-means is a non-hierarchical - involves formation of new clusters by merging or splitting the clusters - clustering method aims to assign objects to a user-defined number of clusters (k) in such a way that maximizes the separation of those clusters while minimizing intra-cluster distances relative to the cluster’s mean.

In [25]:
# Import k-means to perform clusters
from sklearn.cluster import KMeans

# Create a KMeans object with 5 clusters and save as km
km = KMeans(n_clusters=5)

# Fit the k-means object with tfidf_matrix
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

# Create a column cluster to denote the generated cluster for each movie
movies_df["cluster"] = clusters

# Display number of films per cluster (clusters from 0 to 4)
print("Clusters with Counts: \n",movies_df['cluster'].value_counts())
Clusters with Counts: 
 2    61
0    21
1    10
3     5
4     3
Name: cluster, dtype: int64

Similarity Distance

Consider the following two sentences from the movie The Wizard of Oz:

"they find in the Emerald City"

"they finally reach the Emerald City"

If we put the above sentences in a CountVectorizer, the vocabulary produced would be "they, find, in, the, Emerald, City, finally, reach" and the vectors for each sentence would be as follows: </p>

1, 1, 1, 1, 1, 1, 0, 0

1, 0, 0, 1, 1, 1, 1, 1

When we calculate the cosine angle formed between the vectors represented by the above, we get a score of 0.667. This means the above sentences are very closely related. Similarity distance is 1 - cosine similarity angle. This follows from that if the vectors are similar, the cosine of their angle would be 1 and hence, the distance between then would be 1 - 1 = 0.

Let's calculate the similarity distance for all of our movies.

In [26]:
# Import cosine_similarity to calculate similarity of movie plots
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the similarity distance
similarity_distance = 1 - cosine_similarity(tfidf_matrix)

Hierarchical Clustering & Dendrograms

We shall now create a tree-like diagram (called a dendrogram) of the movie titles to help us understand the level of similarity between them visually. Dendrograms help visualize the results of hierarchical clustering, which is an alternative to k-means clustering. Two pairs of movies at the same level of hierarchical clustering are expected to have similar strength of similarity between the corresponding pairs of movies.

For example, the movie Fargo would be as similar to North By Northwest as the movie Platoon is to Saving Private Ryan, given both the pairs exhibit the same level of the hierarchy.

Hierarchical clustering (HC) is an unsupervised clustering technique that groups similar objects into groups called clusters, done in a predefined order. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly like each other. A dendrogram, simply is a diagram that shows the hierarchical relationship between objects. It provides a highly interpretable complete description of the hierarchical clustering in a graphical format.

Let's import the modules we'll need to create our dendrogram.

In [27]:
# Import matplotlib.pyplot for plotting graphs
import matplotlib.pyplot as plt

# Configure matplotlib to display the output inline
%matplotlib inline

# Import modules necessary to plot dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

We shall plot a dendrogram of the movies whose similarity measure will be given by the similarity distance we previously calculated. The lower the similarity distance between any two movies, the lower their linkage will make an intercept on the y-axis. For instance, the lowest dendrogram linkage we shall discover will be between the movies, It's a Wonderful Life and A Place in the Sun. This indicates that the movies are very similar to each other in their plots.</p>

In [28]:
# Create mergings matrix
mergings = linkage(similarity_distance, method='complete')

# Plot the dendrogram, using title as label column
dendrogram_ = dendrogram(mergings,
               labels=[x for x in movies_df["title"]],
               leaf_rotation=90,
               leaf_font_size=16,
)

# Adjust the plot
fig = plt.gcf()
_ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]
fig.set_size_inches(120, 25)

# Show the plotted dendrogram
plt.show()

So for exmaple, if we had to ask ourselves which movie is most similar to the movie Braveheart? We can look at the dendrogram and identify that the answer is Gladiator.